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Abstract 

Recently, machine learning algorithms have suc- 
cessfully entered large-scale real- world indus- 
trial applications (e.g. search engines and email 
spam filters). Here, the CPU cost during test- 
time must be budgeted and accounted for. In 
this paper, we address the challenge of balanc- 
ing the test-time cost and the classifier accuracy 
in a principled fashion. The test-time cost of a 
classifier is often dominated by the computation 
required for feature extraction — which can vary 
drastically across features. We decrease this ex- 
traction time by constructing a tree of classifiers, 
through which test inputs traverse along individ- 
ual paths. Each path extracts different features 
and is optimized for a specific sub-partition of 
the input space. By only computing features for 
inputs that benefit from them the most, our cost- 
sensitive tree of classifiers can match the high ac- 
curacies of the current state-of-the-art at a small 
fraction of the computational cost. 

1. Introduction 

Machine learning algorithms are widely used in many real- 
world applications, ranging from email- spam (Weinberger 
et al, 2009) and adult content filtering (Fleck et al, 1996), 
to web-search engines (Zheng et al., 2008). As machine 
learning transitions into these industry fields, managing 
the CPU cost at test-time becomes increasingly important. 
In applications of such large scale, computation must be 
budgeted and accounted for. Moreover, reducing energy 
wasted on unnecessary computation can lead to monetary 
savings and reductions of greenhouse gas emissions. 

The test-time cost consists of the time required to evaluate a 
classifier and the time to extract features for that classifier. 



where the extraction time across features is highly variable. 
Imagine introducing a new feature to an email spam filter- 
ing algorithm that requires 0.01 seconds to extract per in- 
coming email. If a web-service receives one billion emails 
(which many do daily), it would require 115 extra CPU 
days to extract just this feature. Although this additional 
feature may increase the accuracy of the filter, the cost of 
computing it for every email is prohibitive. This introduces 
the problem of balancing the test-time cost and the clas- 
sifier accuracy. Addressing this trade-off in a principled 
manner is crucial for the applicability of machine learning. 

In this paper, we propose a novel algorithm, Cost-Sensitive 
Tree of Classifiers (CSTC). A CSTC tree (illustrated 
schematically in Fig. 1) is a tree of classifiers that is care- 
fully constructed to reduce the average test-time complex- 
ity of machine learning algorithms, while maximizing their 
accuracy. Different from prior work, which reduces the to- 
tal cost for every input (Efron et al., 2004) or which stages 
the feature extraction into linear cascades (Viola & Jones, 
2004; Lefakis & Fleuret, 2010; Saberian & Vasconcelos, 
2010; Pujara et al, 2011; Chen et al, 2012), a CSTC tree 
incorporates input-dependent feature selection into training 
and dynamically allocates higher feature budgets for infre- 
quently traveled tree-paths. By introducing a probabilis- 
tic tree-traversal framework, we can compute the exact ex- 
pected test-time cost of a CSTC tree. CSTC is trained with 
a single global loss function, whose test-time cost penalty 
is a direct relaxation of this expected cost. This principled 
approach leads to unmatched test-cost/accuracy tradeoffs 
as it naturally divides the input space into sub-regions and 
extracts expensive features only when necessary. 

We make several novel contributions: 1. We introduce the 
meta-learning framework of CSTC trees and derive the ex- 
pected cost of an input traversing the tree during test-time. 
2. We relax this expected cost with a mixed-norm relax- 
ation and derive a single global optimization problem to 
train all classifiers jointly. 3. We demonstrate on syn- 
thetic data that CSTC effectively allocates features to clas- 
sifiers where they are most beneficial and show on large- 
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scale real-world web-search ranking data that CSTC sig- 
nificantly outperforms the current state-of-the-art in test- 
time cost-sensitive learning — maintaining the performance 
of the best algorithms for web- search ranking at a fraction 
of their computational cost. 

2. Related Work 

A basic approach to control test-time cost is the use of /i- 
norm regularization (Efron et al., 2004), which results in a 
sparse feature set, and can significantly reduce the feature 
cost during test-time (as unused features are never com- 
puted). However, this approach fails to address the fact 
that some inputs may be successfully classified by only a 
few cheap features, whereas others strictly require expen- 
sive features for correct classification. 

There is much previous work that extends single classifiers 
to classifier cascades (mostly for binary classification) (Vi- 
ola & Jones, 2004; Lefakis & Fleuret, 2010; Saberian & 
Vasconcelos, 2010; Pujara et al, 2011; Chen et al., 2012). 
In these cascades, several classifiers are ordered into a se- 
quence of stages. Each classifier can either reject inputs 
(predicting them), or pass them on to the next stage, based 
on the prediction of each input. To reduce the test-time 
cost, these cascade algorithms enforce that classifiers in 
early stages use very few and/or cheap features and reject 
many easily-classified inputs. Classifiers in later stages, 
however, are more expensive and cope with more difficult 
inputs. This linear structure is particularly effective for ap- 
plications with highly skewed class imbalance and generic 
features. One celebrated example is face detection in im- 
ages, where the majority of all image regions do not con- 
tain faces and can often be easily rejected based on the re- 
sponse of a few simple Haar features (Viola & Jones, 2004). 
The linear cascade model is however less suited for learn- 
ing tasks with balanced classes and specialized features. It 
cannot fully capture the scenario where different partitions 
of the input space require different expert features, as all 
inputs follow the same linear chain. 

Grubb & Bagnell (2012) and Xu et al. (2012) focus on 
training a classifier that explicitly trades-off test- time cost 
and accuracy. Instead of optimizing the trade-off by build- 
ing a cascade, they push the cost trade-off into the con- 
struction of the weak learners. It should be noted that, in 
spite of the high accuracy achieved by these techniques, 
the algorithms are based heavily on stage-wise regression 
(gradient boosting) (Friedman, 2001), and are less likely to 
work with more general weak learners. 

Gao & Koller (2011) use locally weighted regression dur- 
ing test time to predict the information gain of unknown 
features. Different from our algorithm, their model is 
learned during test-time, which introduces an additional 



cost especially for large data sets. In contrast, our algo- 
rithm learns and fixes the tree structure in training and has 
a test-time complexity that is constant with respect to the 
training set size. 

Hierarchical Mixture of Experts (HME) (Jordan & Jacobs, 
1994) also builds tree-structured classifiers. However, in 
contrast to CSTC, this work is not motivated by reduc- 
tions in test-time cost and results in fundamentally differ- 
ent models. In CSTC, each classifier is trained with the 
test-time cost in mind and each test-input only traverses a 
single path from the root down to a terminal element, ac- 
cumulating path-specific costs. In HME, all test-inputs tra- 
verse all paths and all leaf-classifiers contribute to the final 
prediction, incurring the same cost for all test-inputs. 

Recent tree- structured classifiers include the work of Deng 
et al. (2011), who speed up the training and evaluation of 
label trees (Bengio et al., 2010), by avoiding many binary 
one-vs-all classifier evaluations. Differently, we focus on 
problems in which feature extraction time dominates the 
test-time cost which motivates different algorithmic setups. 
Possibly most similar to our work is (Busa-Fekete et al., 
2012), who learn a directed acyclic graph via a Markov 
decision process to select features for different instances. 
Although similar in motivation, their algorithmic frame- 
work is very different and can be regarded complementary 
to ours. 

3. Cost-sensitive classification 

We first introduce our notation and then formalize our test- 
time cost-sensitive learning setting. Let the training data 
consist of inputs V = {xi , . . . , x^} C with correspond- 
ing class labels {yi, . . . , t/^} ^ 3^, where y = 1Z in the 
case of regression (y could also be a finite set of categor- 
ical labels — because of space limitations we do not focus 
on this case in this paper). 

Non-linear feature space. Throughout this paper, we 
focus on linear classifiers but in order to allow non- 
linear decision boundaries we map the input into a non- 
linear feature space with the "boosting trick" (Fried- 
man, 2001; Chapelle et al., 2011), prior to our optimiza- 
tion. In particular, we first train gradient boosted regres- 
sion trees with a squared loss penalty (Friedman, 2001), 
H'{^i) = Y^J^iht{'Ki), where each function ht{-) is 
a limited-depth CART tree (Breiman, 1984). We then 
apply the mapping x^ 0(xi) to all inputs, where 
(/)(xi) = [/ii(xi), . . . , /iT(xi)]^. To avoid confusion be- 
tween CART trees and the CSTC tree, we refer to CART 
trees ht{-) as weak learners. 

Risk minimization. At each node in the CSTC tree we 
propose to learn a linear classifier in this feature space, 
i^(xi) = ^(xi)^/3 with (3 e 1Z^, which is trained to ex- 
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plicitly reduce the CPU cost during test-time. We learn 
the weight- vector (3 by minimizing a convex empirical risk 
function ^((/)(xi)^/3, yi) with li regularization, \(3\. In ad- 
dition, we incorporate a cost term c(/3), which we derive 
in the following subsection, to restrict test-time cost. The 
combined test-time cost-sensitive loss function becomes 

£(/3) = m^iVf^, Vi) + + A c(/3) , (1) 



regularized risk 

where A is the accuracy/cost trade-off parameter, and p con- 
trols the strength of the regularization. 

Test-time cost. There are two factors that contribute to the 
test-time cost of each classifier. The weak learner evalua- 
tion cost of all active ht{-) (with \Pt \ > 0) and the feature 
extraction cost for all features used in these weak learners. 
We assume that features are computed on demand with the 
cost c the first time they are used, and are free for future 
use (as feature values can be cached). We define an auxil- 
iary matrix F G {0, with F^t = 1 if and only if the 
weak learner ht uses feature /q, . Let > be the cost to 
evaluate a ht{-), and be the cost to extract feature /q,. 
With this notation, we can formulate the total test-time cost 
for an instance precisely as 



c(/3) = ^ei||A||o + ^c„ 



, (2) 



evaluation cost 



feature extraction cost 



where the /q norm for scalars is defined as ||a||o G {0, 1} 
with ||a||o = 1 if and only if a ^ 0. The first term assigns 
cost et to every weak learner used in /3, the second term 
assigns cost to every feature that is extracted by at least 
one of such weak learners. 

Test-cost relaxation. The cost formulation in (2) is exact 
but difficult to optimize as the Iq norms are non-continuous 
and non-differentiable. As a solution, throughout this pa- 
per we use the mixed-norm relaxation of the Iq norm over 
sums. 



E 



(3) 



described by (Kowalski, 2009). Note that for a single el- 
ement this relaxation relaxes the /q norm to the li norm. 



|, and recovers the commonly 



used approximation to encourage sparsity (Efron et al., 
2004; Scholkopf & Smola, 2001). We plug the cost-term 
(2) into the loss in (1) and apply the relaxation (3) to all /q 



PQTr Ti-oo terminal ,0 ) ^ 

ubiu iree ^^^^^^^ ^^-^ tt 



^^<A(x)T/30 < 00 




classifier 
nodes 




,10 



(/3^^^) 



TT 



10 



Figure 1. A schematic layout of a CSTC tree. Each node has a 
threshold to send instances to different parts of the tree and a 
weight vector for prediction. We solve for and 0^ that best 
balance the accuracy/cost trade-off for the whole tree. All paths 
of a CSTC tree are shown in color. 
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where we abbreviate £i = £{(j){-Ki)~^ /3,yi) for simplicity. 
While (4) is cost- sensitive, it is restricted to a single linear 
classifier. In the next section we describe how to expand 
this formulation into a cost-effective tree- structured model. 

4. Cost-sensitive tree 

We begin by introducing foundational concepts regarding 
the CSTC tree and derive a global loss function (5). Similar 
to the previous section, we first derive the exact cost term 
and then relax it with the mixed-norm. Finally, we describe 
how to optimize this function efficiently and to undo some 
of the inaccuracy induced by the mixed-norm relaxations. 

CSTC nodes. We make the assumption that instances 
with similar labels can utilize similar features.^ We there- 
fore design our tree algorithm to partition the input space 
based on classifier predictions. Classifiers that reside deep 
in the tree become experts for a small subset of the in- 
put space and intermediate classifiers determine the path 
of instances through the tree. We distinguish between 
two different elements in a CSTC tree (depicted in Fig- 
ure 1): classifier nodes (white circles) and terminal ele- 
ments (black squares). Each classifier node is associ- 



^For example, in web-search ranking, features generated by 
browser statistics are typically predictive only for highly relevant 
pages as they require the user to spend significant time on the page 
and interact with it. 
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regularized risk 



ated with a weight vector (5^ and a threshold 0^ . Different 
from cascade approaches, these classifiers not only classify 
inputs using /3^, but also branch them by their threshold 
0^, sending inputs to their upper child if (/)(x^)^/3^ > 6^, 
and to their lower child otherwise. Terminal elements 
are "dummy" structures and are not classifiers. They re- 
turn the predictions of their direct parent classifier nodes — 
essentially functioning as a placeholder for an exit out of 
the tree. The tree structure may be a full balanced binary 
tree of some depth (eg. figure 1), or can be pruned based 
on a validation set (eg. figure 4, left). 

During test-time, inputs are first applied to the root node . 
The root node produces predictions (/)(x^)^/3^ and sends 
the input along one of two different paths, depending 
on whether ^(x^)^/3^ > 6>^. By repeatedly branching the 
test-inputs, classifier nodes sitting deeper in the tree only 
handle a small subset of all inputs and become specialized 
towards that subset of the input space. 

4.1. Tree loss 

We derive a single global loss function over all nodes in the 
CSTC tree. 

Soft tree traversal. Training the CSTC tree with hard 
thresholds leads to a combinatorial optimization problem, 
which is NP-hard. Therefore, during training, we softly par- 
tition the inputs and assign traversal probabilities |xi) 
to denote the likelihood of input x^ traversing through 
node v^. Every input x^ traverses through the root, so 
we define p(v^|x^) = 1 for all i. We use the sigmoid 
function to define a soft belief that an input x^ will tran- 
sition from classifier node to its upper child as 
p{v^\yii,v^) = cr{(t){^iy (5^ -0^)} The probability of 
reaching child from the root is, recursively, p{v^ l^i) = 
p(v-^ |x^, v^)p(v^|x^), because each node has exactly one 
parent. For a lower child of parent we naturally oh- 
tainp('u^|xi) = [l — p{v^\:s.i^v^)\p{v^\:s.i). In the follow- 
ing paragraphs we incorporate this probabilistic framework 
into the single-node risk and cost terms of eq. (4) to obtain 
the corresponding expected tree risk and tree cost. 

Expected tree risk. The expected tree risk can be obtained 
byWg over all nodes V and inputs and weighing the risk 



^The sigmoid function is defined as a (a) 
takes advantage of the fact that a (a) G 
strictly monotonic. 



l-\-exp{ — a) 

[0, 1] and that a[ 



and 



(5) 



evaluation cost penalty 



feature cost penalty 



£{-) of input Xi at node v^hy the probability =p(v^ |xi). 



1 

1=1 v'^ev 



(6) 



This has two effects: 1. the local risk for each node fo- 
cusses more on likely inputs; 2. the global risk attributes 
more weight to classifiers that serve many inputs. 

Expected tree costs. The cost of a test-input is the cumu- 
lative cost across all classifiers along its path through the 
CSTC tree. Figure 1 illustrates an example of a CSTC tree 
with all paths highlighted in color. Every test-input must 
follow along exactly one of the paths from the root to a ter- 
minal element. Let L denote the set of all terminal elements 
(e.g., in figure 1 we have L = {v^, v^, v^, v^^}), and for any 
eL let TT^ denote the set of all classifier nodes along the 
unique path from the root before terminal element 
(e.g., TT^ = {v^, v^, v^}). The evaluation and feature cost of 
this unique path is exactly 



E 



E E 1^-/5/ 



evaluation cost 



feature cost 



This term is analogous to eq. (2), except the cost et of the 
weak learner ht is paid if any of the classifiers in path 
TT^ use this tree (i.e. assign (3^ non-zero weight). Similarly, 
the cost Ca of a feature fa is paid exactly once if any of 
the weak learners of any of the classifiers along tt^ require 
it. Once computed, a feature or weak learner can be reused 
by all classifiers along the path for free (as the computation 
can be cached very efficiently). 

Given an input x^, the probability of reaching terminal ele- 
ment e L (traversing along path tt^) is p\ = p(v^|xi). 
Therefore, the marginal probability that a training input 
(picked uniformly at random from the training set) reaches 
yi pi ^ Y,iP{v^\^i)p{^i) = ^ Yh=iP\- With this no- 
tation, the expected cost for an input traversing the CSTC 
tree becomes E[c^] = ^^i^^P^cK Using our /o-norm re- 
laxation in eq. (3) on both /q norms in gives the final 
expected tree cost penalty 



E^' 



E i^i)' +T.^J E E(^-/3^')^ 



which naturally encourages weak learner and feature re-use 
along paths through the CSTC tree. 

Optimization problem. We combine the risk (6) with the 
cost penalties and add the /i-regularization term (which 
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is unaffected by our probabilistic splitting) to obtain the 
global optimization problem (5). (We abbreviate the empi- 
Wisk at node as =^((^(x/)^/3^, yi).) 

4.2. Optimization Details 

There are many techniques to minimize the loss in (5). We 
use a cyclic optimization procedure, solving 



k nk^ 



for 



each classifier node one at a time, keeping all other 
nodes fixed. For a given classifier node v^, the traversal 
probabilities of a descendant node and the probability 
of an instance reaching a terminal element also depend 
on (5^ and 0^ (through its recursive definition) and must be 
incorporated into the gradient computation. 

To minimize (5) with respect to parameters /3^, 6>^, we use 
the lemma below to overcome the non-differentiability of 
the square-root terms (and li norms) resulting from the Iq- 
relaxations (3). 



Lemma 1. Given any g{x) > 0, the following holds: 



^fg{x) = min \ 



(7) 



The lemma can be proved as 2: = \fg{x) minimizes 
the function on the right hand side. Further, it is shown 
in (Boyd & Vandenberghe, 2004) that the right hand side is 
jointly convex in x and z, so long as g{x) is convex. 

For each square-root or li term we introduce an auxiliary 
variable (i.e., z above) and alternate between minimizing 
the loss in (5) with respect to /3^, 6^ and the auxiliary vari- 
ables. The former is performed with conjugate gradient 
descent and the latter can be computed efficiently in closed 
form. This pattern of block-coordinate descent followed by 
a closed form minimization is repeated until convergence. 
Note that the loss is guaranteed to converge to a fixed point 
because each iteration decreases the loss function, which is 
bounded below by 0. 

Initialization. The minimization of eq. (5) is non-convex 
and therefore initialization dependent. However, minimiz- 
ing eq. (5) with respect to the parameters of leaf classi- 
fier nodes is convex, as the loss function, after substitutions 
based on lemma 1, becomes jointly convex (because of the 
lack of descendant nodes). We therefore initialize the tree 
top-to-bottom, starting at , and optimize over (5^ by min- 
imizing (5) while considering all descendant nodes of as 
"cut-off" (thus pretending node is a leaf). 

Tree pruning. To obtain a more compact model and to 
avoid overfitting, the CSTC tree can be pruned with the 
help of a validation set. As each node is a classifier, we can 
apply the CSTC tree on a validation set and compute the 
validation error at each node. We prune away nodes that, 
upon removal, do not decrease the performance of CSTC 



on the validation set (in the case of ranking data, we even 
can use validation NDCG as our pruning criterion). 

Fine-tuning. The relaxation in (3) makes the exact /q cost 
terms differentiable and is well suited to approximate which 
dimensions in a vector (3^ should be assigned non-zero 
weights. The mixed-norm does however impact the per- 
formance of the classifiers because (different from the /q 
norm) larger weights in (3 incur larger penalties in the loss. 
We therefore introduce a post-processing step to correct 
the classifiers from this unwanted regularization effect. We 
re-optimize all predictive classifiers (classifiers with termi- 
nal element children, i.e. classifiers that make final pre- 
dictions), while clamping all features with zero- weight to 
strictly remain zero. 

min^p,^^((/)(x,)^^',^.)+p|^'| 

subject to: =0if/3f = 0. 
The final CSTC tree uses these re-optimized weight vectors 

- k h. 

(5 for all predictive classifier nodes v'^ . 

5. Results 

In this section, we first evaluate CSTC on a carefully con- 
structed synthetic data set to test our hypothesis that CSTC 
learns specialized classifiers that rely on different feature 
subsets. We then evaluate the performance of CSTC on the 
large scale Yahoo! Learning to Rank Challenge data set 
and compare it with state-of-the-art algorithms. 

5.1. Synthetic data 

We construct a synthetic regression dataset, sampled from 
the four quadrants of the X, Z-plane, where X = Z = 
[—1,1]. The features belong to two categories: cheap fea- 
tures, sign{x)^ sign{z) with cost c= 1, which can be used 
to identify the quadrant of an input; and four expensive fea- 
tures ^++, y-\ , ^ h7 y with cost c = 10, which repre- 
sent the exact label of an input if it is from the correspond- 
ing region (a random number otherwise). Since in this syn- 
thetic data set we do not transform the feature space, we 
have ^(x) =x, and F (the weak learner feature-usage vari- 
able) is the 6 X 6 identity matrix. By design, a perfect clas- 
sifier can use the two cheap features to identify the sub- 
region of an instance and then extract the correct expensive 
feature to make a perfect prediction. The minimum fea- 
ture cost of such a perfect classifier is exactly c = 12 per 
instance. The labels are sampled from Gaussian distribu- 
tions with quadrant- specific means ^^Mh 

and variance 1. Figure 2 shows the CSTC tree and the pre- 
dictions of test inputs made by each node. In every path 
along the tree, the first two classifiers split on the two cheap 
features and identify the correct sub-region of the input. 
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Figure 2. CSTC on synthetic data. The box at left describes the artificial data set. The rest of the figure shows the CSTC tree built for 
the data set. At each node we show a plot of the predictions made by that classifier. After each node we show the weight vector that was 
selected to make predictions and send instances to child nodes (if applicable). 
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Figure 3. The test ranking accuracy (NDCG@5) and cost of var- 
ious cost-sensitive classifiers. CSTC maintains its high retrieval 
accuracy significantly longer as the cost-budget is reduced. 



The final classifier extracts a single expensive feature to 
predict the labels. As such, the mean squared error of the 
training and testing data both approach 0. 

5.2. Yahoo! Learning to Rank 

To evaluate the performance of CSTC on real- world tasks, 
we test our algorithm on the public Yahoo! Learning 
to Rank Challenge data set^ (Chapelle & Chang, 2011). 
The set contains 19,944 queries and 473,134 documents. 
Each query-document pair consists of 519 features. 
An extraction cost, which takes on a value in the set 
{1, 5, 20, 50, 100, 150, 200}, is associated with each fea- 



Figure 4. (Left) The pruned CSTC-tree generated from the Ya- 
hoo ! Learning to Rank data set. (Right) Jaccard similarity coeffi- 
cient between classifiers within the learned CSTC tree. 

ture"^. The unit of these values is the time required to eval- 
uate a weak learner ht{-). The label i/i G {4,3,2,1,0} 
denotes the relevancy of a document to its correspond- 
ing query, with 4 indicating a perfect match. In contrast 
to Chen et al. (2012), we do not inflate the number of irrel- 
evant documents (by counting them 10 times). We measure 
the performance using NDCG@5 (Jarvelin & Kekalainen, 
2002), a preferred ranking metric when multiple levels of 
relevance are available. Unless otherwise stated, we restrict 
CSTC to a maximum of 10 nodes. All results are obtained 
on a desktop with two 6-core Intel i7 CPUs. Minimizing 
the global objective requires less than 3 hours to complete, 
and fine-tuning the classifiers takes about 10 minutes. Our 
implementation is freely available as open-source.^ 

Comparison with prior work. Figure 3 shows a compar- 



http : / /learningtorankchallenge . yahoo . com 



The extraction costs were provided by a Yahoo! employee. 

^http : / /www . anonymized . com 
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Tree Depth Cost 

Figure 5. (Left) The ratio of features, grouped by cost, that are extracted at different depths of CSTC (the number of features in each cost 
group is indicated in parentheses in the legend). More expensive features (c > 20) are gradually extracted as we go deeper. (Right) The 
accuracy /cost performance on the UCI Gas Sensor Array Drift data set. The result is averaged over 10 runs, and in each run, the feature 
cost is assigned randomly. We also plot the standard deviation of the accuracy. HME uses all features and weak learners and thus is very 
expensive. The accuracy of CSTC outperforms that of HME at much lower costs. 



ison of CSTC with several recent algorithms for test-time 
cost-sensitive learning. We show NDCG versus cost (in 
units of weak learner evaluations). The plot shows different 
stages in our derivation of CSTC: the initial cost-insensitive 
ensemble classifier H\-) (Friedman, 2001) from section 3 
(stage-wise regression), a single cost-sensitive classifier as 
described in eq. (4), and the CSTC tree (5). We obtain the 
curves by varying the accuracy/cost trade-off parameter A 
(and perform early stopping based on the validation data, 
for fine-tuning). For CSTC tree we evaluate six settings, 
A = {1, 2, 3, 4, 5, 6}. In the case of stage-wise regression, 
which is not cost- sensitive, the curve is simply a function 
of boosting iterations. 

For competing algorithms, we include Early exit (Cam- 
bazoglu et al., 2010) which improves upon stage- wise re- 
gression by short-circuiting the evaluation of unpromising 
documents at test-time, reducing the overall test-time cost. 
The authors propose several criteria for rejecting inputs 
early and we use the best-performing method "early ex- 
its using proximity threshold". For Cronus (Chen et al., 
2012), we use a cascade with a maximum of 10 nodes. 
All hyper-parameters (cascade length, keep ratio, discount, 
early- stopping) were set based on a validation set. The 
cost/accuracy curve was generated by varying the corre- 
sponding trade-off parameter, A. 

As shown in the graph, CSTC significantly improves the 
cost/accuracy trade-off curve over all other algorithms. The 
power of Early exit is limited in this case as the test-time 
cost is dominated by feature extraction, rather than the eval- 
uation cost. Compared with Cronus, CSTC has the abil- 
ity to identify features that are most beneficial to different 
groups of inputs. It is this ability, which allows CSTC to 
maintain the high NDCG significantly longer as the cost- 
budget is reduced. 

Input space partition. Figure 4 (left) shows a pruned 



CSTC tree (A = 4) for the Yahoo! data set. The num- 
ber above each node indicates the average label of the test- 
ing inputs passing through that node. Additionally, each 
node is colored by the portion of testing inputs traversing 
through that node. We can observe that different branches 
aim at different parts of the input domain. In general, the 
upper branches focus on correctly classifying higher ranked 
documents, while the lower branches target low-rank docu- 
ments. Figure 4 (right) shows the Jaccard matrix of the pre- 
dictive classifiers (v^, v^, v^, v^, v^^) from the same CSTC 
tree. The matrix shows a clear trend that the Jaccard co- 
efficients decrease monotonically away from the diagonal. 
This indicates that classifiers share fewer features in com- 
mon if their average labels are further apart — the most dif- 
ferent classifiers and v^^ have only 64% of their features 
in common — and validates that classifiers in the CSTC tree 
extract different features in different regions of the tree. 

Feature extraction. We also investigate the features ex- 
tracted in individual classifier nodes. Figure 5 (left) shows 
the fraction of features, with a particular cost, extracted at 
different depths of the CSTC tree for the Yahoo! data. We 
observe a general trend that as depth increases, more fea- 
tures are being used. However, cheap features (c < 5) 
are fully extracted early-on, whereas expensive features 
(c > 20) are extracted by classifiers sitting deeper in the 
tree, where each individual classifier only copes with a 
small subset of inputs. The expensive features are used to 
classify these subsets of inputs more precisely. The only 
feature that has cost 200 is extracted at all depths — which 
seems essential to obtain high NDCG (Chen et al., 2012). 

5.3. Comparison with HME 

Finally we also compare against Hierarchical Mixtures of 
Experts (HME) (Jordan & Jacobs, 1994). Although our al- 
gorithm shares similar ideas with HME, which also trains 
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trees of classifiers, our work is motivated by reductions in 
test-time cost and results in fundamentally different mod- 
els. Since our HME implementation works only for binary 
data sets and is not optimized for large-scale data, we use 
the UCI Gas Sensor Array Drift data set, which we con- 
vert into a binary classification problem by taking inputs 
of classes 1 and 2. The data is provided in 10 batches, 
of which we select 6 for training, 2 for validation and 2 
for testing (resulting in 2804; 1371; 1316 inputs for each 
respective set). We transform the data using the boosting- 
trick (Chapelle et al., 2011), creating 5000 weak learners. 
As no real feature costs are available, we randomly pick 
cost Cq; G [1, 5, 20, 50, 100, 150] for each feature fa and av- 
erage over 10 runs with different cost assignments. 

Figure 5 (right) depicts the cost/accuracy performance for 
both algorithms with standard deviations in accuracy. We 
generate the curve of CSTC by varying the accuracy/cost 
trade-off parameter A. HME trains a tree of classifiers that 
is insensitive to the cost, and all test inputs pass through 
all paths. Therefore, the cost is very large, as all testing 
inputs use up all features and weak learners. Because the 
cost of features has no effect on training an HME model, 
the testing accuracy is a constant over the 10 runs. 

It is not surprising that CSTC outperforms HME in terms 
of cost, but it is noteworthy that CSTC matches (in fact 
even slightly outperforms) the high accuracy of HME while 
building a much more cost-effective model. 

6. Conclusions 

We introduce Cost- Sensitive Tree of Classifiers (CSTC), a 
novel learning algorithm that explicitly addresses the trade- 
off between accuracy and expected test-time CPU cost in 
a principled fashion. The CSTC tree partitions the input 
space into sub-regions and identifies the most cost-effective 
features for each one of these regions — allowing it to match 
the high accuracy of the state-of-the-art at a small fraction 
of the cost. We obtain the CSTC algorithm by formulating 
the expected test-time cost of an instance passing through 
a tree of classifiers and relax it into a continuous cost func- 
tion. This cost function can be minimized while learning 
the parameters of all classifiers in the tree jointly. By mak- 
ing the test-time cost vs. accuracy tradeoff explicit we en- 
able high performance classifiers that fit into computational 
budgets and can reduce unnecessary energy consumption 
in large-scale industrial applications. Further, engineers 
can design highly specialized features for particular edges- 
cases of their input domain and CSTC will automatically 
incorporate them on-demand into its tree structure. 
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